{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning PyTorch with Examples\n",
    "\n",
    "## 1. Tensors\n",
    "\n",
    "### 1.1 Warm-up: numpy\n",
    "\n",
    "省略……"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 PyTorch: Tensors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "dtype = torch.float\n",
    "device = torch.device('cpu')\n",
    "#device = torch.device('cuda:0') # To run on GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "N, D_in, H, D_out = 64, 1000, 100, 10  # batch size, input dimension, hidden dimension, output dimention\n",
    "x = torch.randn(N, D_in, device=device, dtype=dtype)\n",
    "y = torch.randn(N, D_out, device=device, dtype=dtype)\n",
    "w1 = torch.randn(D_in, H, device=device, dtype=dtype) # Default requires_grad=False, indicates that we need to compute gradients manually when backward!\n",
    "w2 = torch.randn(H, D_out, device=device, dtype=dtype)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 1174780.375\n",
      "1 941842.6875\n",
      "2 778356.375\n",
      "3 655291.9375\n",
      "4 558364.0\n",
      "5 479846.90625\n",
      "6 415080.90625\n",
      "7 360968.1875\n",
      "8 315305.8125\n",
      "9 276538.6875\n",
      "10 243410.5\n",
      "11 214961.34375\n",
      "12 190448.0625\n",
      "13 169241.234375\n",
      "14 150804.640625\n",
      "15 134697.859375\n",
      "16 120583.421875\n",
      "17 108177.4921875\n",
      "18 97238.8515625\n",
      "19 87577.5546875\n"
     ]
    }
   ],
   "source": [
    "learning_rate = 1e-6\n",
    "for t in range(20):\n",
    "    # Forward pass: compute predicted y\n",
    "    h = x.mm(w1)               # Tensor.mm表示相乘，类似于numpy.dot\n",
    "    h_relu = h.clamp(min=0)    # Tensor.clamp表示限定数值在一个范围内，截断太大或太小的数值\n",
    "    y_pred = h_relu.mm(w2)\n",
    "    \n",
    "    # Compute and print loss\n",
    "    loss = (y_pred - y).pow(2).sum().item()\n",
    "    print(t, loss)\n",
    "    \n",
    "    # Backward to compute gradients of w1 and w2 with respect to loss\n",
    "    grad_y_pred = 2.0 * (y_pred - y)\n",
    "    grad_w2 = h_relu.t().mm(grad_y_pred)  # Tensor.t表示转置，类似于numpy.T\n",
    "    grad_h_relu = grad_y_pred.mm(w2.t())\n",
    "    grad_h = grad_h_relu.clone()          # Tensor.clone表示复制，类似于numpy.copy\n",
    "    grad_h[h < 0] = 0\n",
    "    grad_w1 = x.t().mm(grad_h)\n",
    "    \n",
    "    # Update weights using gradient descent\n",
    "    w1 -= learning_rate * grad_w1\n",
    "    w2 -= learning_rate * grad_w2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Autograd\n",
    "\n",
    "### 2.1 PyTorch: Tensors and Autograd\n",
    "\n",
    "We can use ***automatic differentiation*** to aumomate the computation of backward passes in neural network. The ***autograd*** package in PyTorch provides exactly this functionality.\n",
    "\n",
    "When using ***autograd***, the forward pass of your network will define a **computational graph**: nodes in the graph will be **Tensors**, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.\n",
    "\n",
    "Each Tensor represents a node in a computational graph. If ***x*** is a Tensor that has ***x.requires_grad=True***, then ***x.grad*** is another Tensor holding the gradient of ***x*** with respect to some scalar value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "dtype = torch.float\n",
    "device = torch.device('cpu')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 刘尧：注意以下Tensors的**requires_grad**，默认是False，表示不需计算其gradient，设置为True，表示当loss.backward()时，会**automatically计算其gradient**！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "N, D_in, H, D_out = 64, 1000, 100, 10  # batch size, input dimension, hidden dimension, output dimention\n",
    "\n",
    "# Create random Tensors to hold input and outputs\n",
    "# Setting requires_grad=False indicates that we don't need to compute gradients with respect to these Tensors during the backward pass\n",
    "x = torch.randn(N, D_in, device=device, dtype=dtype)\n",
    "y = torch.randn(N, D_out, device=device, dtype=dtype)\n",
    "\n",
    "# Create random Tensors for weights\n",
    "# 关键点：Setting requres_grad=True indicates that we want to compute gradients with respect to these Tensors during the backward pass\n",
    "w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)\n",
    "w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 刘尧：注意**loss.backward()**是关键，会**自动计算相关Tensors(w1和w2)的gradient**并分别保存在**w1.grad和w2.grad**中！以供后续手动更新weights时使用\n",
    "\n",
    "> 刘尧：注意**with torch.no_grad()**，因为我们打算手动更新weights，不再需要track history in autograd！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "134.58912658691406\n",
      "128.32373046875\n",
      "122.36589050292969\n",
      "116.69249725341797\n",
      "111.29835510253906\n",
      "106.16795349121094\n",
      "101.28012084960938\n",
      "96.62560272216797\n",
      "92.19682312011719\n",
      "87.97853088378906\n"
     ]
    }
   ],
   "source": [
    "learning_rate = 1e-6\n",
    "for t in range(10):\n",
    "    # Forward pass\n",
    "    y_pred = x.mm(w1).clamp(min=0).mm(w2)  # 不再需要1.2中的中间变量，因为此时我们不再需要手动计算gradients(手动计算才需要这些中间变量)\n",
    "    \n",
    "    # Compute and print loss\n",
    "    loss = (y_pred - y).pow(2).sum()\n",
    "    print(loss.item())\n",
    "    \n",
    "    # 关键点：Use autograd to compute the backward pass. This call will compute the gradient of loss with respect to all Tensors with requires_grad=True\n",
    "    # After this call, w1.grad and w2.grad will be Tensors holding the gradient of the loss with respect to w1 and w2 respectively\n",
    "    loss.backward()\n",
    "    \n",
    "    # 关键点：Manually update weights using gradient descent. \n",
    "    # Wrap in torch.no_grad() because weights have requires_grad=True, but we don't need to track this in autograd\n",
    "    # Alternative way is to operate on weight.data and weight.grad.data, since tensor.data is a tensor that shares the storage but doesn't track history\n",
    "    # We can also use torch.optim.SGD to achieve this\n",
    "    with torch.no_grad():\n",
    "        w1 -= learning_rate * w1.grad\n",
    "        w2 -= learning_rate * w2.grad\n",
    "        # Manually zero the gradients after updating weights\n",
    "        w1.grad.zero_()\n",
    "        w2.grad.zero_()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 PyTorch: Defining New Autograd Function\n",
    "\n",
    "The ***forward function*** computes output Tensors from input Tensors. The ***backward*** function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyReLU(torch.autograd.Function):\n",
    "    \"\"\"Define custom autograd Function by subclassing torch.autograd.Function and implementing the forward and backward passes which operate on Tensors\"\"\"\n",
    "    \n",
    "    @staticmethod\n",
    "    def forward(ctx, input):\n",
    "        \"\"\"ctx: Context object that can be used to stash information for backward. Cache arbitrary objects for backward using ctx.save_for_backward\"\"\"\n",
    "        ctx.save_for_backward(input)\n",
    "        return input.clamp(min=0)  # 实现ReLU\n",
    "    \n",
    "    @staticmethod\n",
    "    def backward(ctx, grad_output):\n",
    "        \"\"\"grad_output: Tensor containing the gradient of the loss with respect to output\"\"\"\n",
    "        input, = ctx.saved_tensors  # 接收forward中ctx.save_for_backward的objects\n",
    "        grad_input = grad_output.clone()\n",
    "        grad_input[input < 0] = 0\n",
    "        return grad_input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "29840608.0\n",
      "25225794.0\n",
      "23254584.0\n",
      "20728966.0\n",
      "17023182.0\n",
      "12410040.0\n",
      "8325976.0\n",
      "5259737.5\n",
      "3318946.75\n",
      "2152646.75\n"
     ]
    }
   ],
   "source": [
    "dtype = torch.float\n",
    "device = torch.device('cpu')\n",
    "N, D_in, H, D_out = 64, 1000, 100, 10  # batch size, input dimension, hidden dimension, output dimention\n",
    "x = torch.randn(N, D_in, device=device, dtype=dtype)\n",
    "y = torch.randn(N, D_out, device=device, dtype=dtype)\n",
    "w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)\n",
    "w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)\n",
    "learning_rate = 1e-6\n",
    "for t in range(10):\n",
    "    relu = MyReLU.apply               # To apply custom Function, we use Function.apply method and alias this as 'relu'  疑问：可不可以放在for循环外面？！？\n",
    "    y_pred = relu(x.mm(w1)).mm(w2)    # Forward pass  代替了原来的：x.mm(w1).clamp(min=0).mm(w2)\n",
    "    loss = (y_pred - y).pow(2).sum()  # Compute loss\n",
    "    print(loss.item())\n",
    "    loss.backward()                   # Backward pass\n",
    "    with torch.no_grad():\n",
    "        w1 -= learning_rate * w1.grad\n",
    "        w2 -= learning_rate * w2.grad\n",
    "        w1.grad.zero_()\n",
    "        w2.grad.zero_()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 TensorFlow: Static Graphs\n",
    "\n",
    "PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic differentiation to compute gradients. The biggest difference between the two is :\n",
    "\n",
    "TensorFlow’s computational graphs are **static** and PyTorch uses **dynamic** computational graphs.\n",
    "\n",
    "In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly feeding different input data to the graph. \n",
    "\n",
    "In PyTorch, each forward pass defines a new computational graph.\n",
    "\n",
    "> 刘尧：更新weigths这一操作，在Tensorflow中是Computational Graph的一部分，是Static！而在PyTorch中，它发生**在Computational Graph之外，是dynamic**！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "31212168.0\n",
      "24741888.0\n",
      "20985892.0\n",
      "17999220.0\n",
      "16083461.0\n",
      "13378464.0\n",
      "10198802.0\n",
      "7229977.5\n",
      "4866626.5\n",
      "3210210.0\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "N, D_in, H, D_out = 64, 1000, 100, 10\n",
    "x = tf.placeholder(tf.float32, shape=(None, D_in))  # For input, it will be filled with real data when execution\n",
    "y = tf.placeholder(tf.float32, shape=(None, D_out)) # For target\n",
    "w1 = tf.Variable(tf.random_normal((D_in, H)))       # For weights, initialized with random data\n",
    "w2 = tf.Variable(tf.random_normal((H, D_out)))      # Same as above\n",
    "\n",
    "# Forward pass  Note that these code doesn't actually perform any numeric operations. It merely sets up the computational graph\n",
    "h = tf.matmul(x, w1)\n",
    "h_relu = tf.maximum(h, tf.zeros(1))\n",
    "y_pred = tf.matmul(h_relu, w2)\n",
    "\n",
    "# Compute loss and its gradient with respect to w1 and w2\n",
    "loss = tf.reduce_sum((y - y_pred) ** 2.0)\n",
    "grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])\n",
    "\n",
    "# 关键点：Update the weights  Note that in Tensorflow, it's a part of the computational graph; In PyTorch, it happens outside the computational graph!\n",
    "learning_rate = 1e-6\n",
    "new_w1 = w1.assign(w1 - learning_rate * grad_w1)\n",
    "new_w2 = w2.assign(w2 - learning_rate * grad_w2)\n",
    "\n",
    "# Now we have built computational graph, so we enter a Tensorflow session to actually execute the graph\n",
    "with tf.Session() as sess:\n",
    "    sess.run(tf.global_variables_initializer())  # Run the graph once to initialize the Variables w1 and w2\n",
    "    \n",
    "    # Create numpy arrays holding the actual data for inputs x and targets y\n",
    "    x_value = np.random.randn(N, D_in)\n",
    "    y_value = np.random.randn(N, D_out)\n",
    "    for _ in range(10):\n",
    "        # Execute the graph many times. Each time it executes, we bind x_value to x and y_value to y, and compute the values for loss, new_w1 and new_w2\n",
    "        loss_value, _, _ = sess.run([loss, new_w1, new_w2], feed_dict={x: x_value, y: y_value})\n",
    "        print(loss_value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. nn Module\n",
    "\n",
    "### 3.1 PyTorch: nn\n",
    "\n",
    "***nn*** package defines a set of **Modules**, which are roughly equivalent to neural network layers.\n",
    "\n",
    "> 刘尧：Module由Layer组成，同时又可像Layer那样看待和使用！ \n",
    "\n",
    "> 刘尧：另外，PyTorch也可像Keras那样**使用Sequential等high-level API来快速构建Model**！谁说不能的来着！？Custom Model见3.3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "N, D_in, H, D_out = 64, 1000, 100, 10\n",
    "x = torch.randn(N, D_in)\n",
    "y = torch.randn(N, D_out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = torch.nn.Sequential(\n",
    "    torch.nn.Linear(D_in, H),\n",
    "    torch.nn.ReLU(),\n",
    "    torch.nn.Linear(H, D_out)\n",
    ")\n",
    "loss_fn = torch.nn.MSELoss(reduction='sum')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 2.821704626083374\n",
      "1 2.7032277584075928\n",
      "2 2.5902411937713623\n",
      "3 2.4824397563934326\n",
      "4 2.379688024520874\n",
      "5 2.281726121902466\n",
      "6 2.188166856765747\n",
      "7 2.098848581314087\n",
      "8 2.013753652572632\n",
      "9 1.9325002431869507\n"
     ]
    }
   ],
   "source": [
    "learning_rate = 1e-4\n",
    "for t in range(10):\n",
    "    # Forward pass: Module object override __call__ operator, so we can call it like function\n",
    "    y_pred = model(x)\n",
    "    \n",
    "    # Compute and print loss\n",
    "    loss = loss_fn(y_pred, y)\n",
    "    print(t, loss.item())\n",
    "    \n",
    "    # Zero the gradients before running the backward pass\n",
    "    model.zero_grad()\n",
    "    \n",
    "    # Backward pass: compute gradient of the loss with respect to all the learnable parameters of model\n",
    "    loss.backward()\n",
    "    \n",
    "    # Update the weights: Each parameter is a Tensor, so we can access its gradient using Tensor.grad\n",
    "    with torch.no_grad():\n",
    "        for param in model.parameters():\n",
    "            param -= learning_rate * param.grad"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 PyTorch: optim\n",
    "\n",
    "Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (with *torch.no_grad()* or *.data* to avoid tracking history in autograd).\n",
    "\n",
    "This is not a huge burden for simple optimization algorithms like **stochastic gradient descent**, but in practice we often train neural networks **using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc**.\n",
    "\n",
    "> 刘尧：使用optimizer代替之前 with torch.no_grad() 包裹的那一堆代码！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 1.416172381141223e-05\n",
      "1 0.05516074225306511\n",
      "2 0.03499681130051613\n",
      "3 0.025090083479881287\n",
      "4 0.018877189606428146\n",
      "5 0.018316902220249176\n",
      "6 0.018190057948231697\n",
      "7 0.015330985188484192\n",
      "8 0.01237315870821476\n",
      "9 0.01067157182842493\n"
     ]
    }
   ],
   "source": [
    "# N, D_in, H, D_out, x, y, model, loss_fn are same with those in 3.1\n",
    "\n",
    "learning_rate = 1e-4\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)  # The 1st argument tells which Tensors to be updated\n",
    "for t in range(10):\n",
    "    y_pred = model(x)\n",
    "    loss = loss_fn(y_pred, y)\n",
    "    print(t, loss.item())\n",
    "    model.zero_grad()\n",
    "    loss.backward()\n",
    "    optimizer.step()  # Calling the step function on an Optimizer makes an update to its parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 PyTorch: Custom Modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TwoLayerNet(torch.nn.Module):\n",
    "    \n",
    "    def __init__(self, D_in, H, D_out):\n",
    "        super(TwoLayerNet, self).__init__()\n",
    "        self.linear1 = torch.nn.Linear(D_in, H)\n",
    "        self.linear2 = torch.nn.Linear(H, D_out)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        \"\"\"We can use Modules defined in __init__() as well as arbitrary operators on Tensors\"\"\"\n",
    "        h_relu = self.linear1(x).clamp(min=0)\n",
    "        y_pred = self.linear2(h_relu)\n",
    "        return y_pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "N, D_in, H, D_out = 64, 1000, 100, 10\n",
    "x = torch.randn(N, D_in)\n",
    "y = torch.randn(N, D_out)\n",
    "model = TwoLayerNet(D_in, H, D_out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4 PyTorch: Control Flow + Weight Sharing\n",
    "\n",
    "We implement a very strange model: a fully-connected ReLU network that on each forward pass chooses a random number between 1 and 4 and uses that many hidden layers, **reusing the same weights multiple times to compute the innermost hidden layers**.\n",
    "\n",
    "For this model we can use normal Python flow control to implement the loop, and we can implement weight sharing among the innermost layers by **simply reusing the same Module multiple times when defining the forward pass**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import torch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "class DynamicNet(torch.nn.Module):\n",
    "    \n",
    "    def __init__(self, D_in, H, D_out):\n",
    "        super(DynamicNet, self).__init__()\n",
    "        self.input_linear = torch.nn.Linear(D_in, H)\n",
    "        self.middle_linear = torch.nn.Linear(H, H)\n",
    "        self.output_linear = torch.nn.Linear(H, D_out)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        \"\"\"\n",
    "        For the forward pass, we reuse the middle_linear Module many times to compute hidden layer representations. \n",
    "        It's perfectly safe to reuse the same Module many times when defining a computational graph. It's a big improvement from Lua Torch.\n",
    "        \"\"\"\n",
    "        h_relu = self.input_linear(x).clamp(min=0)\n",
    "        for _ in range(random.randint(0, 3)):\n",
    "            h_relu = self.middle_linear(h_relu).clamp(min=0)  # 多次使用Module的同一个实例(self.middle_linear)，以实现weight sharing！\n",
    "        y_pred = self.output_linear(h_relu)\n",
    "        return y_pred"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
